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1 Introduction 



Concurrent data structures are the data sharing side of parallel programming. 
Data structures give the means to the program to store data but also provide op- 
erations to the program to access and manipulate these data. These operations 
are implemented through algorithms that have to be efficient. In the sequen- 
tial setting, data structures are crucially important for the performance of the 
respective computation. In the parallel programming setting, their importance 
becomes more crucial because of the increased use of data and resource shar- 
ing for utilizing parallelism. In parallel programming, computations are split 
into subtasks in order to introduce parallelization at the control/computation 
level. To utilize this opportunity of concurrency, subtasks share data and var- 
ious resources (dictionaries, buffers, and so forth). This makes it possible for 
logically independent programs to share various resources and data structures. 
A subtask that wants to update a data structure, say add an element into a 
dictionary, that operation may be logically independent of other subtasks that 
use the same dictionary. 

Concurrent data structure designers are striving to maintain consistency 
of data structures while keeping the use of mutual exclusion and expensive 
synchronization to a minimum, in order to prevent the data structure from 
becoming a sequential bottleneck. Maintaining consistency in the presence of 
many simultaneous updates is a complex task. Standard implementations of 
data structures arc based on locks in order to avoid inconsistency of the shared 
data due to concurrent modifications. In simple terms, a single lock around the 
whole data structure may create a bottleneck in the program where all of the 
tasks serialize, resulting in a loss of parallelism because too few data locations 
are concurrently in use. Deadlocks, priority inversion, and convoying are also 
side-effects of locking. The risk for deadlocks makes it hard to compose different 
blocking data structures since it is not always possible to know how closed 
source libraries do their locking. It is worth noting that in graphics processors 
(CPUs) locks are not recommended for designing concurrent data structures. 
CPUs prior to the NVIDIA Fermi architecture do not have writable caches, 
so for those CPUs, repeated checks to see if a lock is available or not require 
expensive repeated accesses to the CPU's main memory. While Fermi CPUs do 
support writable caches, there is no guarantee that the thread scheduler will be 
fair, which can make it difficult to write deadlock-free locking code. OpenCL 
explicitly disallows locks for these and other reasons. 

Lock-free implementations of data structures support concurrent access. 
They do not involve mutual exclusion and make sure that all steps of the sup- 
ported operations can be executed concurrently. Lock-free implementations 
employ an optimistic conflict control approach, allowing several processes to 
access the shared data object at the same time. They suffer delays only when 
there is an actual conflict between operations that causes some operations to 
retry. This feature allows lock-free algorithms to scale much better when the 
number of processes increases. 

An implementation of a data structure is called lock-free if it allows multiple 
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processes/threads to access the data structure concurrently and also guarantees 
that at least one operation among those finishes in a finite number of its own 
steps regardless of the state of the other operations. A consistency (safety) re- 
quirement for lock- free data structures is Unearizahility [44j , which ensures that 
each operation on the data appears to take effect instantaneously during its 
actual duration and the effect of all operations are consistent with the object's 
sequential specification. Lock-free data structures offer several advantages over 
their blocking counterparts, such as being immune to deadlocks, priority inver- 
sion, and convoying, and have been shown to work well in practice in many 
different settings [HH [HI] . They have been included in Intel's Threading Build- 
ing Blocks Framework [7S|, the NOBLE library [M] and the Java concurrency 
package [56| , and will be included in the forthcoming parallel extensions to the 
Microsoft .NET Framework [69]. They have also been of interest to designers 
of languages such as C-|— I- [H] and Java [5S] . 

This chapter has two goals. The first and main goal is to provide a sufficient 
background and intuition to help the interested reader to navigate in the com- 
plex research area of lock-free data structures. The second goal is to offer the 
programmer familiarity to the subject that will allow her to use truly concurrent 
methods. 

The chapter is structured as follows. First we discuss the fundamental and 
commonly-supported synchronization primitives on which efficient lock-free data 
structures rely. Then we give an overview of the research results on lock-free 
data structures that appeared in the literature with a short summary for each of 
them. The problem of managing dynamically-allocated memory in lock-free con- 
current data structures and general concurrent environments is discussed sepa- 
rately. Following this is a discussion on the idiosyncratic architectural features 
of graphics processors that are important to consider when designing efficient 
lock-free concurrent data structures for this emerging area. 

2 Synchronization Primitives 

To synchronize processes efficiently, multi-/many-core systems usually support 
certain synchronization primitives. This section discusses the fundamental syn- 
chronization primitives, which typically read the value of a single memory word, 
modify the value and write the new value back to the word atomically. 

2.1 Fundamental synchronization primitives 

The definitions of the primitives are described in Figure [U where a; is a memory 
word, V, old, new arc values and op can be operators add, sub, or, and and xor. 
Operations between angle brackets () arc executed atomically. 

Note that there is a problem called the ABA problem that may occur with 
the CAS primitive. The reason is that the CAS operation can not detect if a 
variable was read to be A and then later changed to B and then back to A by 
some concurrent processes. The CAS primitive will perform the update even 
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FAO{x,v) /* fetch- and- op */ 

{oldx <— s; a: op{x,v); return oldx; ) 



TAS(a;) /* test-and-set, init: x <— */ 
{oldx <— x; X 1; return oldx; ) 



LL(x) /* load-linked */ 

{return the value of x so that 
it may be subsequently used 
with SC > 



CAS{x, old, new) /* compare- and- swap */ 
{ if(x = old) {x *r- new; return(trtie); } 
else return(/a/se); ) 



SC(x,i)) /* store- conditional */ 
{ if (no process has written to x 
since the last LL(x)) {x v; 
return(true)}; 
else return(/a?se); ) 



Figure 1: Synchronization primitives 



though this might not be intended by the algorithm's designer. The LL/SC 
primitives can instead detect any concurrent update on the variable between 
the time interval of a LL/SC pair, independent of the value of the update. 

2.2 Synchronization power 

The primitives are classified according to their synchronization power or consen- 
sus number |57] . which is, roughly speaking, the maximum number of processes 
for which the primitives can be used to solve a consensus problem in a fault 
tolerant manner. In the consensus problem, a set of n asynchronous processes, 
each with a given input, communicate to achieve an agreement on one of the 
inputs. A primitive with a consensus number n can achieve consensus among n 
processes even if up to n — 1 processes stop 

According to the consensus classification, read/write registers have consensus 
number 1, i.e. they cannot tolerate any faulty processes in the consensus setting. 
There are some primitives with consensus number 2 (e.g. test-and-set (TAS) and 
fetch-and-op (FAO)) and some with infinite consensus number (e.g. compare- 
and-swap ( CAS) and load-linked/ store- conditional (LL/SC)). It has been proven 
that a primitive with consensus number n cannot implement a primitive with a 
higher consensus number in a system of more than n processes j57j . For example, 
the test-and-set primitive, whose consensus number is two, cannot implement 
the compare- and- swap primitive, whose consensus number is unbounded, in a 
system of more than two processes. 

2.3 Scalability and Combinability 

As many-core architectures with thousands of cores are expected to be our 
future chip architectures [5], synchronization primitives that can support scal- 
able thread synchronization for such large-scale architectures are desired. In 
addition to synchronization power criterion, synchronization primitives can be 
classified by their scalability or combinability [54]. Primitives are combinable 
if their memory requests to the same memory location (arriving at a switch of 
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the processor-to-memory interconnection network) can be combined into only 
one memory request. Separate replies to the original requests are later created 
from the reply to the combined request (at the switch). The combining tech- 
nique has been implemented in the NYU Ultracomputer [3D| and the IBM RP3 
machine [73] , and has been shown to be a scalable technique for large-scale mul- 
tiprocessors to alleviate the performance degradation due to a synchronization 
"hot spot". The set of combinablc primitives includes test-and-set, fetch-and-op 
(where op is an associative operation or boolean operation) , blocking full-empty 
bits [SU and non-blocking full-empty bits [3B]. For example, two consecutive re- 
quests fetch- and- add{x, a) and fetch- and- addix^ b) can be combined into a single 
request fetch-and-add{x,a -\- b). When receiving a reply oldx to the combined 
request fetch- and- add{x, a-\-b), the switch at which the requests were combined, 
creates a reply oldx to the first request fetch- and- add{x, a) and a reply {oldx-\-a) 
to the successive request fetch- and- add{x, b). 

The CAS primitives are not combinable since the success of a CAS{x, a, b) 
primitive depends on the current value of the memory location x. For m-bit 
locations (e.g. 64-bit words), there are 2™ possible values and therefore, a 
combined request that represents k CAS{x, a, b) requests, k < 2"\ must carry as 
many as k different checking- values a and k new values b. The LL/ SC primitives 
are not combinable either since the success of a 5C primitive depends on the 
state of its reservation bit at the memory location that has been set previously by 
the corresponding LL primitive. Therefore, a combined request that represents 
k SC requests (from different processes/processors) must carry as many as k 
store values. 

2.4 Multi-word Primitives 

Although the single-word hardware primitives are conceptually powerful enough 
to support higher-level synchronization, from the programmer's point of view 
they are not as convenient as multi-word primitives. The multi-word primitives 
can be built in hardware [52l [TBI EIj or in software (in a lock- free manner) 
using single- word hardware primitives [3l [HI [SH [50l [7T1 [79]. Sun's third gen- 
eration chip-multithreaded (CMT) processor called Rock is the first processor 
supporting transactional memory in hardware jllj . The transactional memory 
is supported by two new instructions checkpoint and commit^ in which check- 
point denotes the beginning of a transaction and commit denotes the end of 
the transaction. If the transaction succeeds, the memory accesses within the 
transaction take effect atomically. If the transaction fails, the memory accesses 
have no effect. 

Another emerging construct is the Advanced Synchronization Facility (ASF) , 
an experimental AMD64 extension that AMD's Operating System Research 
Center develops to support lock-free data structures and software transactional 
memory [16| . ASF is a simplified hardware transactional memory in which all 
memory objects to be protected should be statically specified before transaction 
execution. Processors can protect and speculatively modify up to 8 memory ob- 
jects of cache-line size. There is also research on new primitives aiming at iden- 
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tifying new efficient and powerful primitives, with the non- blocking full/empty 
bit (NB-FEB) being an example that was shown to be as powerful as CAS or 
LL/SC [Sg. 

3 Lock-Free Data Structures 

The main characterization on which one can classify the various implementations 
of lock-free data structures available in the literature, is what abstract data type 
that it intends to implement. For each abstract data type there are usually 
numerous implementations, each motivated by some specific targeted purposes, 
where each implementation is characterized by the various properties that it 
fulfills to different amounts. As many of these properties are orthogonal, for 
each specific implementation, one or more properties are often strengthened at 
the cost of some others. Some of the most important properties that differentiate 
the various lock-free data structure implementations in the literature are: 
Semantic fulfillments Due to the complexity of designing lock- free data struc- 
tures it might not be possible to support all operations normally associated with 
a certain abstract data type. Hence, some algorithms omit a subset of the nor- 
mally required operations and/or support operations with a modified semantics. 
Time complexity Whether an operation can terminate in a time (without 
considering concurrency) that is linearly or logarithmically related to e.g. the 
size of the data structure, can have significant impact on performance. More- 
over, whether the maximum execution time can be determined at all or if it 
can be expected in relation to the number of concurrent threads is of significant 
importance to time-critical systems (e.g. real-time systems). 
Scalability Scalability means showing some performance gain with increas- 
ing number of threads. Synchronization primitives are normally not scalable 
in themselves; therefore it is important to avoid unnecessary synchronization. 
Israeli and Rappoport |50| have defined the term disjoint-access-parallelism to 
identify algorithms that do not synchronize on data that is not logically involved 
simultaneously in two or more concurrent operations. 

Dynamic capacity In situations where it can be difficult to determine the 
maximum number of items that will be stored in a data structure, it is necessary 
that the data structure can dynamically allocate more memory when the current 
capacity is about to be exceeded. If the data structure is based on statically 
allocated storage, capacity is fixed throughout the lifetime of the data structure. 
Space complexity Some algorithms can guarantee an upper bound of memory 
required, while some others can transiently need an indefinite amount depending 
on the concurrent operations' invocation order, and can thus not be determin- 
istically determined. 

Concurrency limitations Due to the limitations (e.g. consensus number) of 
the chosen synchronization primitives, some or all operations might not allow 
more than a certain number of concurrent invocations. 

Synchronization primitives Contemporary multi-core and many-core sys- 
tems typically only support single-word CAS or weak and non-nestable variants 
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of LL/SC (cf. section [5]). However, many algorithms for lock- free data struc- 
ture depend on more advanced primitives as e.g. double-word CAS (called e.g. 
DCAS or CAS2), ideal LL/SC or even more complex primitives. These algorithms 
then need (at least) one additional abstraction layer for actual implementation, 
where these more advanced primitives arc implemented in software using an- 
other specific algorithm. The LL/SC primitives can be implemented e.g. by 
CAS [501 m [701 EH [65]. Multi-word CAS (called e.g. MWCAS or CASN) can be 
implemented e.g. by CAS [MllSl] or by LL/SC [SOI H [Zi [HI IM] • 
Reliability Some algorithms try to avoid the ABA problem by the means of 
e.g. version counters. As these counters are bounded and can overflow, there is a 
potential risk of the algorithm to actually perform incorrectly and possibly cause 
inconsistencies. Normally, by design this risk can be kept low enough that it fits 
for practical purposes, although the risk increases as the computational speed 
increases. Often, version counters can be removed altogether by the means of 
proper memory management. 

Compatibility and Dependencies Some algorithms only work together with 
certain memory allocators and reclamation schemes, specific types (e.g. real- 
time) of system-level process scheduler, or require software layers or semantic 
constructions only found in certain programming languages (e.g. Java). 

3.1 Overview 

The following sections include a systematic overview of the research result in 
the literature. For a more in-depth look and a case-study in the design of a 
lock-free data structure and how it can be used in practice, we would like to 
refer the reader to our chapter in "GPU Computing Gems" [TOl, which describes 
in detail how to implement a lock-free work-stealing deque and the reasoning 
behind the design decisions. 

3.2 Producer-Consumer Collections 

A common approach to parallelizing applications is to divide the problem into 
separate threads that act as either producers or consumers. The problem of 
synchronizing these threads and streaming of data items between them, can be 
alleviated by utilizing a shared collection data structure. 

Bag 

The Bag abstract data type is a collection of items in which items can be 
stored and retrieved in any order. Basic operations are Add (add an item) and 
TryRemoveAny (remove an arbitrary chosen item). TryRemoveAny returns the 
item removed. Data structures with similar semantics are also called buffer, 
unordered collection, unordered queue, pool, and pile in the literature. 

All lock-free stacks, queues and deques implicitly implements the selected 
bag semantics. Afek et al. [T] presented an explicit pool data structure. It is 
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lock-free, although not linearizable, utilizes distributed storage and is based on 
randomization to establish a probabilistic level of disjoint-access-parallelism. 

In [17] a data structure called flat-sets, was introduced and used as a 
building block in the concurrent memory allocation service. This is a bag-like 
structure that supports lock-free insertion and removal of items as well as an 
"inter-object" operation, for moving an item from one flat-set to another in a 
lock-free and linearizable manner, thus offering the possibility of combining data 
structures. 

In |83] a lock-free bag implementation is presented; the algorithm supports 
multiple producers and multiple consumers, as well as dynamic collection sizes. 
To handle concurrency efficiently, the algorithm was designed to optimize for 
disjoint-access-parallelism for the supported semantics. 

Stack 

The Stack abstract data type is a collection of items in which only the most 
recently added item may be removed. The latest added item is at the top. Basic 
operations are Push (add to the top) and Pop (remove from the top). Pop returns 
the item removed. The data structure is also known as a "last-in, first-out" or 
LIFO buffer. 

Treiber presented a lock-free stack (a.k.a. IBM Frcclist) based on linked 
lists, which was later efficiently fixed from the ABA problem by Michael [M] . 
Also Valois [15] presented a lock-free implementation that uses the CAS atomic 
primitive. Hendler et al. [41] presented an extension where randomization and 
elimination are used for increasing scalability when contention is detected on 
the CAS attempts. 

Queue 

The Queue abstract data type is a collection of items in which only the earliest 
added item may be accessed. Basic operations arc Enqueue (add to the tail) and 
Dequeue (rcmovc from the head). Dequeue returns the item removed. The data 
structure is also known as a "first-in, first-out" or FIFO buffer. 

Lamport [55| presented a lock-free (actually wait-free) implementation of a 
queue based on a static array, with a limited concurrency supporting only one 
producer and one consumer. Giacomoni et al. [23] presented a cache-aware 
modification which, instead of using shared head and tail indices, synchronize 
directly on the array elements. Herman and Damian-Iordache |46| outlined a 
wait-free implementation of a shared queue for any number of threads, although 
non-practical due to its high time complexity and limited capacity. 

Gong and Wing [5^ and later Shann et al. [75] presented a lock-free shared 
queue based on a cyclic array and the CAS primitive, though with the drawback 
of using version counters, thus requiring double-width CAS for storing actual 
items. Tsigas and Zhang [50] presented a lock-free extension of [SH] for any 
number of threads where synchronization is done both on the array elements 
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and the shared head and tail indices using CAS, and the ABA problem is avoided 
by exploiting two (or more) null values. 

Valois inn inS] makes use of linked lists in his lock- free implementation which 
is based on the CAS primitive. Prakash et al. [73] also presented an implemen- 
tation using linked lists and the CAS primitive, although with the drawback of 
using version counters and having low scalability. Michael and Scott [55] pre- 
sented a lock-free queue that is more efficient, synchronizing via the shared head 
and tail pointers as well via the next pointer of the last node. Moir et al. [72] 
presented an extension where elimination is used as a back-off strategy when 
contention on CAS is noticed, although elimination is only possible when the 
queue contains very few items. Hoffman et al. |47[ takes another approach for 
a back-off strategy by allowing concurrent Enqueue operations to insert the new 
node at adjacent positions in the linked list if contention is noticed. Gidenstam 
et al. |28| combines the efficiency of using arrays and the dynamic capacity of 
using linked lists, by providing a lock-free queue based on linked lists of arrays, 
all updated using CAS in a cache-aware manner. 

Deque 

The Deque (or doubly-ended queue) abstract data type is a combination of the 
stack and the queue abstract data types. The data structure is a collection of 
items in which the earliest as well as the latest added item may be accessed. 
Basic operations are PushLeft (add to the head), PopLeft (remove from the head), 
PushRight (add to the tail), and PopRight (remove from the tail). PopLeft and 
PopRight return the item removed. 

Large efforts have been put on the work on so called work-stealing deques. 
These data structures only support three operations and with a limited level 
of concurrency, and are specifically aimed for scheduling purposes. Arora et al. 
[3] presented a lock-free work-stealing deque implementation based on the CAS 
atomic primitive. Hendler et al. j40j improved this algorithm to also handle 
dynamic sizes. 

Several lock-free implementations of the deque abstract data type for general 
purposes, although based on the non-available CAS2 atomic primitive, have been 
published in the literature [211 [21 [131 \SS\ IS] • Michael [53] presented a lock- free 
deque implementation based on the CAS primitive, although not supporting any 
level of disjoint-access-parallelism. Sundell and Tsigas [88] presented a lock-free 
implementation that allows both disjoint-access-parallelism as well as dynamic 
sizes using the standard CAS atomic primitive. 

Priority Queue 

The Priority Queue abstract data type is a collection of items which can effi- 
ciently support finding the item with the highest priority. Basic operations are 
Insert (add an item), FindMin (finds the item with minimum (or maximum) pri- 
ority), and DeleteMin (removes the item with minimum (or maximum) priority). 
DeleteMin returns the item removed. 
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Israeli and Rappoport [32] have presented a wait-free algorithm for a shared 
priority queue, that requires the non-available multi-word LL/SC atomic prim- 
itives. Grcenwald [3T] has presented an outline for a lock-free priority queue 
based on the non-available CAS2 atomic primitive. Barnes [7] presented an in- 
complete attempt for a lock-free implementation that uses atomic primitives 
available on contemporary systems. Sundell and Tsigas [57| presented the first 
lock-free implementation of a priority queue based on skip lists and the CAS 
atomic primitive. 

3.3 Lists 

The List abstract data type is a collection of items where two items are related 
only with respect to their relative position to each other. The data structure 
should efficiently support traversals among the items. Depending on what type 
of the underlying data structure, e.g. arrays or linked lists, different strengths 
of traversal functionality arc supported. 

Array 

List implementations based on the fundamental array data structure can sup- 
port traversals to absolute index positions. Higher level abstractions as extend- 
able arrays are in addition supporting stack semantics. Consequently, the Array 
abstract data type would support the operations ReadAt (read the element at in- 
dex), WriteAt (write the element at index). Push (add to the top) and Pop (remove 
from the top). Pop returns the item removed. 

A lock-free extendable array for practical purposes has been presented by 
Dechev et al. [T^ . 

Linked List 

In a concurrent environment with List implementations based on linked lists, 
traversals to absolute index positions are not feasible. Consequently, traversals 
are only supported relatively to a current position. The current position is 
maintained by the cursor concept, where each handle (i.e. thread or process) 
maintains one independent cursor position. The first and last cursor positions 
do not refer to real items, but are instead used as end markers, i.e. before the 
first item or after the last item. Basic operations arc InsertAfter (add a new item 
after the current). Delete (remove the current item). Read (inspect the current 
item). Next (traverse to the item after the current). First (traverse to the position 
before the first item). Additional operations are InsertBefore (add a new item 
before the current). Previous (traverse to the item before the current), and Last 
(traverse to the position after the last item). 

Lock-free implementations of the singly-linked list based on the CAS atomic 
primitive and with semantics suitable for the Dictionary abstract type rather 
than the List has been presented by Harris l39| , Michael [61] , and Fomitchev and 
Ruppert [IH] . Greenwald [32] presented a doubly- linked list implementation of a 
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dictionary based on the non-available CAS2 atomic primitive. Attiya and Hillel 
[6] presented a CAS2-ha.sed implementation that also supports disjoint-access- 
parallclism. Valois [Hi] outlined a lock-free doubly-linked list implementation 
with all list semantics except delete operations. A more general doubly-linked 
list implementation supporting general list semantics was presented by Sundell 
and Tsigas |HH]- 

3.4 Sets and Dictionaries 

The Set abstract data type is a collection of special items called keys, where each 
key is unique and can have at most one occurrence in the set. Basic operations 
are Add (adds the key), ElementOf (checks if key is present), and Delete (removes 
the key). 

The Dictionary abstract data type is a collection of items where each item 
is associated with a unique key. The data structure should efficiently support 
finding the item associated with the specific key. Basic operations are Insert (add 
an item associated with a key). Find (finds the item associated with a certain 
key), and Delete (removes the item associated with a certain key). Delete returns 
the item removed. In concurrent environments, an additional basic operation is 
the Update (re-assign the association of a key with another item) operation. 

Implementations of Sets and Dictionaries are often closely related in a way 
that most implementations of a set can be extended to also support dictionary 
semantics in a straight forward manner. However, the Update operation mostly 
needs specific care in the fundamental part of the algorithmic design to be 
linearizable. Non-blocking implementations of sets and dictionaries arc mostly 
based on hash-tables or linked lists as done by Valois [95]. The path using 
concurrent linked lists was improved by Harris |39j . Other means to implement 
sets and dictionaries are the skip-list and tree data structures. 

Skip-List 

Valois |95| outlined an incomplete idea of how to design a concurrent skip list. 
Sundell and Tsigas presented a lock-free implementation of a skip list in the 
scope of priority queues [SSJ |57] as well as dictionaries [HSl ISI] using the CAS 
primitive. Similar constructions have appeared in the literature by Fraser [20] . 
and Fomitchcv and Ruppert ^TE\ . 

Hash- Table 

Michael [61] presented a lock-free implementation of the set abstract data type 
based on a hash-table with its chaining handled by an improved linked list 
compared to [39j . To a large part, its high efficiency is thanks to the mem- 
ory management scheme applied. The algorithm was improved by Shalcv and 
Shavit [77] to also handle dynamic sizes of the hash-table's underlying array data 
structure. Greenwald |32j have presented a dictionary implementation based on 
chained hash-tables and the non-available CAS2 atomic primitive. 
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Gao et al. [H] presented a lock-free implementation of the dictionary ab- 
stract data type based on a hash-table data structure using open addressing. 
The hash-table is fully dynamic in size, although its efficiency is limited by its 
relatively complex memory management. 

Tree 

Tsay and Li |89| presents an approach for designing lock-free implementations of 
a tree data structure using the LL/SC atomic primitives and extensive copying 
of data. However, the algorithm is not provided with sufficient evidence for 
showing lincarizability. Ellen et al. |17| presented a lock-free implementation of 
the set abstract data type based on a binary tree data structure using the CAS 
atomic primitive. Spiegel and Reynolds |80| presents a lock-free implementation 
of the set abstract data type based on a skip-tree and the CAS atomic primitive. 

4 Memory Management for Concurrent Data- 
Structures 

The problem of managing dynamically allocated memory in a concurrent envi- 
ronment has two parts, keeping track of the free memory available for allocation 
and safely reclaim allocated memory when it is no longer in use, i.e. memory 
allocation and memory reclamation. 

4.1 Memory Allocation 

A memory allocator manages a pool of memory (heap), e.g. a contiguous range 
of addresses or a set of such ranges, keeping track of which parts of that memory 
are currently given to the application and which parts are unused and can be 
used to meet future allocation requests from the application. A traditional (such 
as the "libc" malloc) general purpose memory allocator is not allowed to move 
or otherwise disturb memory blocks that are currently owned by the application. 

Some of the most important properties that distinguish memory allocators 
for concurrent applications in the literature are: 

Fragmentation To minimize fragmentation is to minimize the amount of free 
memory that cannot be used (allocated) by the application due to the size of 
the memory blocks. 

False-sharing False sharing is when different parts of the same cache-line are 
allocated to separate objects that end up being used by threads running on 
different processors. 

Efficiency and scalability The concurrent memory allocator should be as fast 
as a good sequential one when executed on a single processor and its performance 
should scale with the load in the system. 

Here we focus on lock-free memory allocators but there is also a considerable 
number of lock-based concurrent memory allocators in the literature. 
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Early work on lock-free memory allocation is the work on non-blocking oper- 
ating systems by Massalin and Pu [60l [59] and Greenwald and Cheriton [33l [31] . 
Dice and Garthwaite [TS] presented LFMalloc, a memory allocator based on 
the architecture of the Hoard lock-based concurrent memory allocator [S] but 
with reduced use of locks. Michael [5^ presented a fully lock-free allocator, also 
loosely based on the Hoard architecture. Gidenstam et al. [55] presented NBmal- 
loc, another lock-free memory allocator loosely based on the Hoard architecture. 
NBmalloc is designed from the requirement that the first-remove-then-insert ap- 
proach to moving references to large internal blocks of memory (superblocks) 
around should be avoided and therefore introduces and uses a move operation 
that can move a reference between different internal data-structures atomically. 
Schneider et al. [76] presented Streamflow, a lock-free memory allocator that 
has improved performance over previous solutions due to allowing thread local 
allocations and deallocations without synchronization. 

4.2 Memory Reclamation 

To manage dynamically allocated memory in non-blocking algorithms is difficult 
due to overlapping operations that might read, change or dereference (i.e. fol- 
low) references to dynamically allocated blocks of memory concurrently. One of 
the most problematic cases is when a slow process dereferences a pointer value 
that it previously read from a shared variable. This dereference of the pointer 
value could occur an arbitrarily long time after the shared pointer holding that 
value was overwritten and the memory designated by the pointer removed from 
the shared data structure. Consequently it is impossible to safely free or reuse 
the block of memory designated by this pointer value until we are sure that 
there are no such slow processes with pointers to that block. 

There are several reclamation schemes in the literature with a wide and 
varying range of properties: 

I. Safety of local references For local references, which are stored in private 
variables accessible only by one thread, to be safe the memory reclamation 
scheme must guarantee that a dynamically allocated node is never reclaimed 
while there still are local references pointing to it. 

II. Safety of shared references Additionally, a memory reclamation scheme 
could also guarantee that it is always safe for a thread to dereference any shared 
references located within a dynamic node the thread has a local reference to. 
Property I alone does not guarantee this, since for a node that has been deleted 
but cannot be reclaimed yet any shared references within it could reference 
nodes that have been deleted and reclaimed since the node was removed from 
the data structure. 

III. Automatic or explicit deletion A dynamically allocated node could 
either be reclaimed automatically when it is no longer accessible through any 
local or shared reference, that is, the scheme provides automatic garbage col- 
lection^ or the user algorithm or data structure could be required to explicitly 
tell the memory reclamation scheme when a node is removed from the active 
data structure and should be reclaimed as soon as it has become safe. While 
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Property II 


Property III 


Property IV 


Property V 


Michael [Bl [M] 


No 


Explicit 


Yes 


Yes 


Herlihy et al. [43] 


No 


Explicit 


Yes 


No 


Valois et al. [95l [67] 


Yes 


Automatic 


No 


Yes 


Detlefs et al. [H] 


Yes 


Automatic 


Yes 


No 


Herlihy et al. [42] 


Yes 


Automatic 


Yes 


No 


Gidenstam et al. [Ml [IS] 


Yes 


Explicit 


Yes 


Yes 


Eraser [20] 


Yes 


Explicit 


Yes 


Yes 


Herlihy et al. [45] 


Yes 


Automatic 


Integrated 


Yes 


Gao et al. [22] 


Yes 


Automatic 


Integrated 


Yes 



Table 1: Properties of different approaches to non-blocking memory reclamation. 

automatic garbage collection is convenient for the user, explicit deletion by the 
user gives the reclamation scheme more information to work with and can help 
to provide stronger guarantees, e.g. bounds on the amount of deleted but yet 
unreclaimed memory. 

IV. Requirements on the memory allocator Some memory reclamation 
schemes require special properties from the memory allocator, like, for example, 
that each allocable node has a permanent (i.e. for the rest of the system's 
lifetime) reference counter associated with it. Other schemes are compatible 
with the well-known and simple allocate/ free allocator interface where the node 
has ceased to exist after the call to free. 

V. Required synchronization primitives Some memory reclamation schemes 
are defined using synchronization primitives that few if any current processor 
architectures provide in hardware, such as for example double word CAS, which 
then have to be implemented in software often adding considerable overhead. 
Other schemes make do with single word CAS, single word LL/SC or even just 
reads and writes alone. 

The properties of the memory reclamation schemes discussed here are sum- 
marized in Tabic [T] One of the most important is Property II, which many 
lock-free algorithms and data structures need. Among the memory reclamation 
schemes that guarantee Property II we have the following ones, all based on 
reference coimting: Valois et al. [95] [67], Detlefs et al. [14], Herlihy et al. [42] 
and Gidenstam et al. [211 [25] and the potentially blocking epoch-based scheme 
by Fraser HO]- 

On the other hand, for data structures that do not need Property II, for 
example stacks, the use of a reclamation scheme that does not provide this 
property has significant potential to offer reduced overhead compared to the 
stronger schemes. Among these memory reclamation schemes we have the non- 
blocking ones by Michael [BH [M] and Herlihy et al. |33] . 

Fully Automatic Garbage Collection A fully automatic garbage collector 
provides property I, II and HI with automatic deletion. 

There are some lock-free garbage collectors in the literature. Herlihy and 
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Moss presented a lock- free copying garbage collector in [iS] . Gao et al. [H] pre- 
sented a lock-free Mark & Sweep garbage collector and Kliot at al. [53] presented 
a lock-free stack scanning mechanism for concurrent garbage collectors. 

5 Graphics Processors 

Currently the two most popular programming environments for general purpose 
computing for graphics processors arc CUDA and OpcnCL. Neither provides any 
direct support for locks, and it is unlikely that this will change in the future. 
Concurrent data structures that arc used on graphics processors will therefore 
have to be lock-free. 

While graphics processors share many features with conventional processors, 
and many lock-free algorithms can be ported directly, there are some differences 
that are important to consider, if one also wants to maintain or improve the 
scalability and throughput of the algorithms. 

5.1 Data Parallel Model 

A graphics processor consists of a number of multiprocessors that can execute 
the same instruction on multiple data, known as SIMD computing. Concur- 
rent data structures are, as the name implies, designed to support multiple 
concurrent operations, but when used on a multiprocessor they also need to 
support concurrent instructions within an operation. This is not straightfor- 
ward, as most have been designed for scalar processors. Considering that SIMD 
instructions play an instrumental role in the parallel performance offered by the 
graphics processor, it is imperative that this issue be addressed. 

Graphics processor have a wide memory bus and a high memory bandwidth, 
which makes it possible to quickly transfer data from the memory to the pro- 
cessor and back. The hardware is also capable of coalescing multiple small 
memory operations into a single, large, atomic memory operation. As a single 
large memory operation can be performed faster than many small, this should 
be taken advantage of in the algorithmic design of the data structure. 

The cache in graphics processors is smaller than on conventional SMP pro- 
cessors and in many cases non-existent. The memory latency is instead masked 
by utilizing thousands of threads and by storing data temporally in a high-speed 
multiprocessor local memory area. The high number of threads reinforces the 
importance of the data structure being highly scalable. 

The scheduling of threads on a graphics processor is commonly being per- 
formed by the hardware. Unfortunately, the scheme used is often undocu- 
mented, thus there is no guarantee that it will be fair. This makes the use 
of algorithms with blocking behavior risky. For example, a thread holding a 
lock could be indefinitely swapped out in favor of another thread waiting for 
the same lock, resulting in a livelock situation. Lock-freeness is thus a must. 

Of a more practical concern is the fact that a graphics processor often lacks 
stacks, making recursive operations more difficult. The lack of a joint address 
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space between the GPU and the CPU also complicates the move of data from 
the CPU to the graphics processor, as all pointers in the data structure have to 
be rebased when moved to a new address. 

5.2 New Algorithmic Design 

The use of SIMD instructions means that if multiple threads write to the same 
memory location, only one (arbitrary) thread can succeed. Thus, allowing 
threads that will be combined to a SIMD unit by the hardware to concurrently 
try to enqueue an item to the same position in a queue, will with all likeli- 
hood be unnecessarily expensive, as only one thread can succeed in enqueing 
its item. Instead, by first combining the operations locally, and then trying to 
insert all elements in one step, this problem can be avoided. This is a technique 
used by XMalloc, a lock- free memory allocator for graphics processors [35]. On 
data structures with more disjoint memory access than a queue, the problem is 
less pronounced, as multiple operations can succeed concurrently if they access 
different parts of the memory. 

An example of a way to take advantage of the SIMD instructions and memory 
coalescing, is to allow each node in a tree to have more children. Allowing a node 
in a tree to have more children will have the effect of making the tree shallower 
and lower the number of nodes that needs to checked when searching for an 
item. As a consequence, the time spent in each node will increase, but with 
coalesced memory access and SIMD instructions, this increase in time spent 
can be limited by selecting the number of children to suit the SIMD instruction 
size. The node can then be read in a single memory operation and the correct 
child can be found using just two SIMD compare instructions. 

Another suggestion is to use memory coalescing to implement lazy opera- 
tions, where larger read and write operations replace a percentage of expensive 
CAS operations. An array-based queue for example docs not need to update 
its tail pointer using CAS every time an item is inserted. Instead it could be 
updated every x:th operation, and the correct tail could be found by quickly 
traversing the array using large memory reads and SIMD instructions, reducing 
the traversal time to a low static cost. This type of lazy updating was used in 
the queue by Tsigas and Zhang [9T] . 

The coalescing memory access mechanism also directly influences the syn- 
chronization capabilities of the graphics processor. It has for example been 
shown that it can be used to facilitate wait-free synchronization between threads, 
without the need of synchronization primitives other than reads and writes 
[351137]. 

When it comes to software-controlled load balancing, there have been experi- 
ments made comparing the built-in hardware scheduler with a software managed 
work-stealing approach [Q . It was shown that lock- free implementations of data 
structures worked better than lock-based, and that lock-free work-stealing could 
outperform the built-in scheduler. 

The lack of a stack can be a significant problem for data structures that 
require recursive helping for lock-freeness. While it is often possible to rewrite 
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recursive code to work iteratively instead, it requires that recursive depth can 
be bounded to lower the amount of memory that needs to be allocated. 
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